White Wine Analysis by Matthew Bonilla

This report will analyze the quality of white wines and how chemical factors, such as acidity, sugar, pH levels, and alcohol content affect it. There are 11 input variables taken from physiochemical tests that make up one output data based on sensory data with a dataset of about 4,900 wines.

Univariate Plots Section

## [1] 4898   13
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

The dataset contains 4,898 observations with 12 variables. (The first, X, is simply the count of the observation).

Taking a look at our output variable, quality, we can see that we do not have as much points less than 5 and more than 7. We do not have any values for 0,1,2, and 10.

From the input variables present, alcohol content, as a percentage, is one that most are familiar with. We can see that most of the alcohol percentage of the wines are 10 and below.

The first graph didn’t seem to tell enough of a story so we take the second graph and use a binwidth of 5. It looks as if most wines have residual sugar levels below 20 gram/liter. We do have a few above 20 and one above 60 gram/liter.

Fixed acidity, volatile acidity, and citric acid all have a right skewed structure with all three having what look to be outliers on the higher end. There seems to be a spike in the amount of wines with a citric acid of 0.5.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600
## 
##    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09  0.1 0.11 0.12 0.13 0.14 
##   19    7    6    2   12    5    6   12    4   12   14    1   19   17   27 
## 0.15 0.16 0.17 0.18 0.19  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 
##   23   33   27   49   48   70   66  104   83  181  136  219  216  282  223 
##  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 
##  307  200  257  183  225  137  177  134  122  101  117   82   95   37   63 
## 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 
##   46   51   38   39  215   35   25   23   16   19   11   22   13   21    6 
##  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 
##    6    9   14    4    6    8    7    7    7    5    3    9    5    5   41 
## 0.78 0.79  0.8 0.81 0.82 0.86 0.88 0.91 0.99    1 1.23 1.66 
##    2    2    2    2    2    1    1    2    1    5    1    1

This would definitely be something to look at in the future, why is there such a large amount of wines with citric acid of 0.49 yet levels of 0.48 and 0.5 are not that high.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

We see that a large concentration of chloride content lay between 0.025 and 0.075. However, a large portion of our chloride data lies above the 3rd quartile. We can transform our chlorides to investigate.

From here, it’s a clearer picture as to where values lie.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

We can see that free.sulfur.dioxide follows a right skewed distribution where a peak is centered around 36. We see that total.sulfur.dioxide is also right skewed with a peak centered around 135. There is a secondary peak at 150 so this may be a point we should look at. In our comparison graph, we can see how there is a slight overlap between free and total sulfur dioxide. this is interesting because we can see how as you add ‘bound’ sulfur dioxide to free, it elongates the graph of total.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Sulphates appear to be distributed around the .5 range. There are a few points past 1.0 we should note for later.

Pretty normal distribution around 3.15

Seem to be distributed around .993. There seems to be a value well past a normal range.

Univariate Analysis

What is the structure of your dataset?

There are 4,898 observations containing 12 features, 11 of which are chemical qualities and 1 is a ranking from 0 to 10, 10 being the best. Most of the graphs appear to follow a normal distribution. This may be the reason why the quality scores also follow a normal distribution.

What is/are the main feature(s) of interest in your dataset?

The main feature is quality and some value of acidity. From univariate plots, we can see that certain graphs portray extremes in their values so those are key points we want to look at. Especially when quality is normally distributed around 6 and there are not a lot of extremes.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

A large portion of the investigation will be comparing fixed and volatile acidity, along with citric acid and residual sugar levels. I believe these affect the taste of the wine of the most.

Did you create any new variables from existing variables in the dataset?

From univariate data, I was not able to see a reason to create any new existing variables. From the wines dictionary, it stated that total.sulfur.dioxide was a combination of free and bound forms of SO2, however we do not know the correct calculation to create variables.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

This dataset was clean. I did have to perform adjustments to the graphs to show the extremes of certain values such as chlorides but other than that, each value for each variable was consistent in how it should be presented.

Bivariate Plots Section

First, let’s look at how each variable relates to each other.

From this, we can see that there is no clear indication of how the quality of a wine will be based on a single chemical property. There is no strong correlation between any of the 11 variables and quality of wine. The strongest correlation would be alcohol content to quality with a correlation of .436.

While this is interesting to see that wines with a quality of 9 would only have an alcohol content greater than 10, we cannot make an assumption from such a small sample size. This graph does not really tell us much other than quality of wines widely ranges in alcohol content.

From this box plot, we can more clearly see how alcohol does not make a difference. The median alcohol content for a quality of 3 and a quality of 6 both fall around the 10.5 range.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.55   10.45   10.35   11.00   12.60
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   12.40   12.50   12.18   12.70   12.90

Specifically, we can see that the median for wines with a quality of 3 have a median of 10.45 and wines with a quality of 6 have a median of 10.5. We can also deduce that even though wines with a quality of 9 have a median of 12.5 and an average of 12.18 percent of alcohol content does not mean that wines with higher alcohol content mean a higher quality. Primarily, there are only four values within this subset and with a mean appearing outside of the boxplot, the mean is highly affected by outliers. If we got more quality scores of 9, it could be that these values are not representative of the whole.

cor(wines$quality, wines$fixed.acidity)
## [1] -0.1136628
cor(wines$quality, wines$volatile.acidity)
## [1] -0.194723
cor(wines$quality, wines$citric.acid)
## [1] -0.009209091

Using cor(), we can see the relationships between the various acid and acidity levels in wines to the quality of it. From the top: fixed acidity, volatile acidity, and citric acid.

Our previous assumption that acidity of a wine would lead to quality seems to be faltering. With the thought that as acidity grew, quality would decrease. With a -.1, we could say we were heading in the right direction but the data dictionary only stated that “in high levels” of acidity would taste degrade.

Since it’s only high levels, let’s take a look at the various acidity levels above a certain threshold.

Taking a look at the top 10% of fixed acidity, we do see a slight trend where wines of a higher quality have lower and lower mean fixed acidity.

Here, we take the whole trend of fixed acidity. While following the same trend, it looks less clear as medians and means reach a middle around 7.5. Let’s do the same of volatile and citric acid.

Like its name, volatile acidity is truly volatile in its results. There is no clear trend on how volatile acidity affects quality.

Unfortunately, the same holds true for citric acid and quality of wine. There is not much of a correlation and these boxplots show that. Here, we expected to see an increase of quality as citric acid increased since it was supposed to add flavors to wines. However, it seems that the amount of citric acid does not help.

It looks as if there is a trend appearing. Lets make it easier to see clusters and remove the top percentages.

We can see an upward trend that as free sulfur dioxide increases, so does total sulfure dioxide. We expect this from the description of free and total sulfur dioxide.

Initially looking at this, we can see a clear trend in that as the amount of alcohol increases, the density of the wine decreases.

If we limit our view to density values less than 99%, we can definitely see the trend. Chemically this makes sense and reinforces the connection between water levels and alcohol.

Taking a quick glance, it doesn’t look like there’s much between pH levels and the density of the wine.

This graph further shows that lack of relationship between pH and density. While they may not relate to each other now, they could possible show a trend for quality of wines together.

This is surprising since this relationship was not noted in the data dictionary but this does chemically make sense since the more sugar you add to a liquid, the denser the product gets. Let’s take a more in depth look.

## [1] 0.8389665

We can more clearly see the relationship between residual sugar and density. Using the cor() funciton, we can see a .839 correlation between the two. Individually, they may not be able to show a correlation with quality but with a multivariate analysis, we may be able to see.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

A big observation was that not one single strongly correlated with the quality of wine. This makes sense because otherwise, wine makers would have an easier time creating good wine. We have these multitude of variables because it takes all of these to make good wine.

It was also nice to see confirmation of relationships that the data dictionary stated such as acidity. It was shocking that it did not affect wine quality as much as I expected however. I was also shocked to see that residual sugar did not have a stronger correlation with quality.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

I think a the most interesting relationship I found was between density and residual sugar levels. It makes sense when it is given some thought but initially it may be hard to see.

What was the strongest relationship you found?

With a correlation of .835, density and residual.sugar were the strongest. Individually, compared with quality, they do not pose a strong relationship but I believe together, they may be able to provide guidance.

Multivariate Plots Section

When we include the quality of each point to the density vs residual sugar graph, we can see an interesting separation. Those in the lower have are mostly wines with high quality. Very few low quality wines are in the lower half. It makes sense since we would like white wiens to have a lighter taste. When sugar levels are low, we make up for quality through the lightness of the drink. As sugar levels rise, the density of the sugar and the quality become muddled and tehre is less of a clear distinction. The regression lines depict this as well. Qualities of 7, 8, or 9 all start from a low density while lower quality wines start from a higher density.

We saw earlier that as alcohol increases, so does the quality of wine, which is shown here. We can also see that as density decreases, quality also tends to rise. This is clear from the observation earlier where we compared density and alcohol content. What is interesting is this fade-off of quality 5 and lower wines past 11% alcohol content and below .99 density. These lower quality wines dominate when alcohol content is below 10 but suddenly seem to disappear.

The initial graph showed that pH and density did not have much of a correlation. With this, it’s interesting to see that the lower quality wines group around the same area instead of being equally distributed.

The linear models provided by each quality between free and total sulfur dioxide portray a widely varied linear model with large area with a 95% CI for qualities of 3, 4, 8, and 9.

Looking at the graphs, we can see that wines with a quality of less than 5 usually tend to have lower free sulfur dioxide. A large majority is below 40 with less and less as free sulfur dioxide grows. Wines with a quality of 5 or 6 tend to have a wide spread between both free sulfur dioxide and total sulfur dioxide. It looks like the ratio of total sulfur dioxide to free sulfur dioxide plays a role in the quality since those with lower ratios tend to be 6. Wines with a quality of 7 or higher tend to now have total sulfur dioxide below 50 and free sulfur dioxide below 20.

With lower quality wines, it looks as if our assumption is correct where if they have a lower total sulfur dioxide to free sulfur dioxide ratio, the wines will have a higher quality. Maybe comparing sulfur dioxide to other variables will provide a clearer insight.

## [1] -0.2191773

However, using with(wines, cor(total.free, quality)) shows that the ratio provides a better correlation compared to the variables by themselves. total.free to quality is -.21917 compared to -.175 and .00816.

This graph was a surprising find since it shows a small grouping of qualities 5, 6, and 7. A cluster of qualities of 5 stay below 10% alcohol content, qualities of 6 are above 10% and below 11.5% and qualities of 7 are above 11.5%. There are of course values that break this trend but the clustering of colors shows distinct patterns. Does the ratio of alcohol to sulphates correlate strongly with qualities of wine?

## [1] 0.1753643

Unfortunately, the ratio of alcohol/sulphates does not correlate better than alcohol by itself. It went from .435 to .175.

While we would not be able to see a trend between pH and chlorides, when we overlay quality, we can see a slight trend where higher quality tend to have lower chlorides while medium qualities can be see having higher chlorides with a wider spread of pH.

Using a simple boxplot, we can see a small trend of average pH/chlorides increasing as the quality of wine increases also.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

It was easier to see how quality affect the graphs where the x or y wasn’t quality. My first and favorite was residual sugar vs density. You can could clearly see a line where quality changed based on the level of density v residual sugar. It was clear to see how qualities would cluster together depending on the variables.

Other features of interest, such as alcohol, got a bit strengthened by other features that related strongly with alcohol, for example Density, since their values are defined off each other. Here, we could see where quality was clearly defined in density v alcohol graphs. Otherwise, alcohol with other variables that did not related to it, would be worse off.

Were there any interesting or surprising interactions between features?

An interesting feature was how density and residual sugar affected quality of wines. I think seeing the clear line was very exciting for me and enabled me to seek out other comparable features.

Another surprising one was how the total.sulfur.dioxide vs. total/free sulfur dioxide graph looked. I expected something structural like the free.sulfur.dioxide vs. total/free but it looked very seashell like. I think it was interesting to see how some values ended up making a diagonal line in varying slopes. It definitely was not how I expected the graph to look.


Final Plots and Summary

Plot One

Description One

I find this boxplot interesting because, while it is simple, it quickly and easily proves the idea that the more acidic a wine it, the less the quality becomes. We can see a trend of the medians and overall quantiles decreasing as the quality increases. I have added a mean function for each quality and it also shows a similar trend of decreasing as the quality increases.

A key portion of this graph is that it only represents the values that have a fixed.acidity value in the top 10%. This is done to strengthen the visualization of change between each quality rank and their fixed acidity level. If we did not include this, there is still a trend but the graph displays it on a much smaller scale.

Plot Two

Description Two

This scatterplot was my favorite during this analysis because it was the most stark contrast in differences in quality and how the variables could affect quality. First, this scatterplot shows how density and residual sugar relate to each other. As residual sugar levels rise, so does density. What makes this visualization stand out is how separated quality is from quality <= 5 and quality >= 6. Of course there are one off instances where they are in a differentarea but overall, the clusters of qualities add to the effect of the graph.

I also added a linear model expecting the trend line as show but not to be such a clear wall between the two qualities (5 and 6) of wines. As residual sugar levels go higher, the two ‘sides’ do meet at the ‘point’ of the graph.

The visualization is limited to only the bottom 99.9% since there are extreme outliers that extend the limits past a decent view point.

Plot Three

Description Three

This scatterplot is interesting because I think viewing the relationship between alcohol and quality is clearer this way. Sulphates, in this graph, could have actually been another variable and still would have displayed the same idea. The graph of alcohol and sulphates, however, does represent this in such a nice manner where each value lines up on a grid. I changed the shapes to be squares so that it aligns nicer than circles.

From here, we can see wines of 5 quality usually have an alcohol content less than 10. We can also see wines of 7 quality tend to have an alcohol content more than 11. We can also see that a lot of the wines of quality 8 tend to have alcohol content more than 12. This is in line with the fact that alcohol has the highest correlation to quality compared to the other variables. As stated before, if we replaced sulphates with most other variables, it would still show the same increase of quality as alcohol increases.


Reflection

In the White Wines data set, I expected to find a clear cut way to decide whether but that was definitely not it. I was shocked to see that alcohol content did slightly correlate to quality. It was also nice to see how graphs came together. I also found that R Documentation actually ended up helping a lot with creating graphs. I think I’ll take what I learned and be more consistent with my graphs from now on.

A lot of the struggle was with figuring out which variables would work best. I believe that if I had the bandwidth, It would have been interesting to see ratios of each variables or other ways to connect to variables to predict price such as multiplicatives or additives. I think I could also try and create a linear model but that goes back to figuring out which variables would go well together.